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Prevotella multisaccharivorax Sakamoto ef al. 2005 is a species of the large genus Prevotella, 
which belongs to the family Prevotellaceae. The species is of medical interest because its 
members are able to cause diseases in the human oral cavity such as periodontitis, root caries 
and others. Although 77 Prevotella genomes have already been sequenced or are targeted for 
sequencing, this is only the second completed genome sequence of a type strain of a species 
within the genus Prevotella to be published. The 3,388,644 bp long genome is assembled in 
three non-contiguous contigs, harbors 2,876 protein-coding and 75 RNA genes and is a part 
of the Genomic Encyclopedia of Bacteria and Archaea project. 



Introduction 



Strain PPPA20 T (= DSM 17128 = JCM 12954) is the 
type strain of Prevotella multisaccharivorax [1]. 
Currently, there are about 50 species placed in the 
genus Prevotella [1]. The species epithet is derived 
from the Latin adjective multus meaning 
'many/much', the Latin noun saccharum meaning 
'sugar' and the Latin adjective vorax meaning 'liking 
to eat' referring to the metabolic properties of the 
species to digest a variety of carbohydrates [2]. P. 
multisaccharivorax strain PPPA20 T is considered to 
be an opportunistic pathogen and was isolated 
from subgingival plaque from a patient with chron- 
ic periodontitis. Additionally, five more strains iso- 
lated from the human oral cavity were placed in the 



species P. multisaccharivorax [2]. Using non-culture 
techniques on sites affected by endodontic and pe- 
riodontal diseases, a large number of sequences 
have been found that belong to Prevotella and Pre- 
votellaASke bacteria. Many of those species have 
never been isolated or described [3]. The complex 
microbial community living in the rich ecological 
niche of the human oral cavity and its interaction 
with consumed food will be of lasting interest for 
medical and ecological reasons [4,5]. Here we 
present a summary classification and a set of fea- 
tures for P. multisaccharivorax PPPA20 T , together 
with the description of the non-contiguous finished 
genomic sequencing and annotation. 
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Classification and features 

A representative genomic 16S rRNA sequence of P. 
multisaccharivorax PPPA20 T was compared using 
NCBI BLAST [6] under default settings (e.g., consi- 
dering only the high-scoring segment pairs (HSPs) 
from the best 250 hits] with the most recent release 
of the Greengenes database [7] and the relative fre- 
quencies of taxa and keywords (reduced to their 
stem [8]) were determined, weighted by BLAST 
scores. The most frequently occurring genus was 
Prevotella (100.0%) (14 hits in total). Regarding the 
single hit to sequences from members of the spe- 
cies, the average identity within HSPs was 100.0%, 
whereas the average coverage by HSPs was 98.0%. 
Regarding the nine hits to sequences from other 
members of the genus, the average identity within 
HSPs was 90.3%, whereas the average coverage by 
HSPs was 66.5%. Among all other species, the one 
yielding the highest score was Prevotella ruminicola 
(AF218618), which corresponded to an identity of 
91.5% and an HSP coverage of 66.3%. (Note that 
the Greengenes database uses the INSDC (= 
EMBL/NCBI/DDBJ) annotation, which is not an au- 
thoritative source for nomenclature or classifica- 
tion.) The highest-scoring environmental sequence 
was AY550995 ('human carious dentine clone IDR- 
CEC-0032'), which showed an identity of 99.8% and 
an HSP coverage of 94.5%. The most frequently oc- 
curring keywords within the labels of environmen- 
tal samples which yielded hits were 'fecal' (4.4%), 
'beef, cattl' (4.1%), 'anim, coli, escherichia, feedlot, 
habitat, marc, pen, primari, secondari, stec, syneco- 
log' (4.0%), 'neg' (2.5%) and 'fece' (2.4%) (236 hits 
in total). The most frequently occurring keywords 
within the labels of environmental samples which 
yielded hits of a higher score than the highest scor- 
ing species were 'fece' (7.9%), 'goeldi, marmoset' 
(4.8%), 'microbiom' (4.3%), 'aspect, canal, oral, 
root' (3.9%) and 'rumen' (3.8%) (54 hits in total). 
While some of these keywords correspond to the 
well known habitat of P. multisaccharivorax, others 
indicate additional habitats related to animals. 

Figure 1 shows the phylogenetic neighborhood of P. 
multisaccharivorax in a 16S rRNA based tree. The 
sequences of the four 16S rRNA gene copies in the 
genome differ from each other by up to two nucleo- 
tides, and differ by up to two nucleotides from the 
previously published 16S rRNA sequence 
AB200414. 

The cells of P. multisaccharivorax generally have the 
shape of rods (0.8 x 2.5-8.3 mjtl) and occur singly or 



in pairs (Figure 2). They can also form longer fila- 
ments. P. multisaccharivorax is a Gram-negative, 
non spore-forming bacterium (Table 1). The organ- 
ism is described as non-motile and only four genes 
associated with motility were identified in the ge- 
nome (see below). P. multisaccharivorax grows well 
at 37°C, is strictly anaerobic, chemoorganotrophic 
and is able to ferment cellobiose, glucose, glycerol, 
lactose, maltose, mannose, melezitose, raffinose, 
rhamnose, sorbitol, sucrose, trehalose and xylose 
[2]. Acid production from arabinose and salicin is 
variable. The organism does not reduce nitrate or 
produce indole from tryptophan but it hydrolyzes 
esculin and digests gelatin [2]. Growth of P. multi- 
saccharivorax is inhibited by the addition of 20% 
bile. Major fermentation products are succinic and 
acetic acid, isovaleric acid is produced in small 
amounts [2]. Activities of glucose-6-phosphate de- 
hydrogenase (G6PDH) and 6-phosphogluconate 
dehydrogenase (6GPDH) were not detected in iso- 
lates of this species, whereas malate dehydrogenase 
and glutamate dehydrogenase activities were de- 
tected in all strains. P. multisaccharivorax produces 
acid and alkaline phosphatase, (B-galactosidase, a- 
and (B-glucosidase, N-acetyl-(B-glucosaminidase, a- 
aminofuranosidase and alanine aminopeptidase. 
The organism has no demonstrable esterase (C4), 
esterase lipase (C4), lipase (C4), leucine arylami- 
dase, valine arylamidase, cystine arylamidase, py- 
roglutamic acid arylamidase, trypsin, chymotrypsin, 
(B-glucuronidase, a-mannosidase, a-fucosidase, ar- 
ginine aminopeptidase, leucine aminopeptidase, 
proline aminopeptidase, tyrosine aminopeptidase, 
phenylalanine aminopeptidase, urease or catalase 
activity [2]. 

Chemotaxonomy 

In contrast to other Prevotella species all strains of 
P. multisaccharivorax harbor the menaquinones 
MK-12 (40-55%) and MK-13 (40-45%) in large 
amounts, whereas MK-10 (1-3%) and MK-11 (8- 
10%) were found only in small amounts [2]. The 
fatty acid pattern for all strains of P. multisacchari- 
vorax revealed Ci8 : i m g c (21.7%) and Ci 6: o (12.9%) as 
major compounds as well as iso-Cn.o 3-oh (9.2%), 
anteiso-C 15 0 (7.8%), Ci 8: 2 u6,9c (7.5%) and iso-ds.o 
(6.4%) in smaller amounts [2]. Additionally, the 
unusual dimethyl acetals were found with Ci 6: o di- 
methyl aldehyde in the highest amount of 8.2%. 
This clearly distinguishes the species of P. multisac- 
charivorax from other related Prevotella species 
[2]- 
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Figure 1. Phylogenetic tree highlighting the position of P. multisaccharivorax relative to the type strains of the oth- 
er species within the family. The tree was inferred from 1 ,425 aligned characters [9,1 0] of the 1 6S rRNA gene se- 
quence under the maximum likelihood (ML) criterion [11]. Rooting was done initially using the midpoint method 
[12] and then checked for its agreement with the current classification (Table 1). The branches are scaled in terms 
of the expected number of substitutions per site. Numbers adjacent to the branches are support values from 600 
ML bootstrap replicates [13] (left) and from 1,000 maximum parsimony bootstrap replicates [14] (right) if larger 
than 60%. Lineages with type strain genome sequencing projects registered in GOLD [15] are labeled with one 
asterisk, those also listed as 'Complete and Published' should be labeled with two asterisks: P. ruminicola [1 6] and 
P. melaninogenica (CP002122/CP002123) 
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Figure 2. Scanning electron micrograph of P. multisaccharivorax PPPA20 T 



Table 1. Classification and general features of P. multisaccharivorax PPPA20 7 according the MIGS 
recommendations [17] and the NamesforLife database [1]. 



MIGS ID 


Property 


Term 


Evidence code 






Domain Bacteria 
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Current classification 
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Dpecies rievoLena iiiuiusaLLiianvoiax 


TAC. [91 
1 r\D LZJ 






1 Vpe SLiaill rrrnZU 


TAC. [91 
l AD LZJ 




Gram stain 


negative 


TAS [2] 




Cell shape 


rod-shaped 


TAS [2] 




Motility 


non-motile 


TAS [2] 




Sporulation 


none 


TAS [2] 




Temperature range 


mesophilic 


TAS [2] 




Optimum temperature 


37°C 


TAS [2] 




Salinity 


physiological 


TAS [2] 


MIGS-22 


Oxygen requirement 


obligately anaerobic 


TAS [2] 




Carbon source 


carbohydrates 


TAS [2] 




Energy metabolism 


chemoorganotrophic 


TAS [2] 


MIGS-6 


Habitat 


host, human oral microflora 


TAS [2] 


MIGS-15 


Biotic relationship 


free-living 


NAS 


MIGS-14 


Pathogenicity 


opportunistic pathogen 


TAS [2] 




Biosafety level 


2 


TAS [24] 




Isolation 


subgingival plaque, chronic periodontitis 


TAS [2] 


MIGS-4 


Geographic location 


Japan 


TAS [2] 


MIGS-5 


Sample collection time 


December 9, 2002 


IDA 


MIGS-4.1 


Latitude 


not reported 




MIGS-4.2 


Longitude 


not reported 




MIGS-4.3 


Depth 


not reported 




MIGS-4.4 


Altitude 


not reported 





Evidence codes - IDA: Inferred from Direct Assay (first time in publication); TAS: Traceable Author Statement (i.e., 
a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not directly observed for the liv- 
ing, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These 
evidence codes are from of the Gene Ontology project [25]. If the evidence code is IDA, the property was directly 
observed by one of the authors or an expert mentioned in the acknowledgements. 
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Genome sequencing and annotation 

Genome project history 

This organism was selected for sequencing on the 
basis of its phylogenetic position [26], and is part 
of the Genomic Encyclopedia of Bacteria and Arc- 
haea project [27]. The genome project is depo- 
sited in the Genomes On Line Database [15] and 

Table 2. Genome sequencing project information 



the complete genome sequence is deposited in 
GenBank. Sequencing, finishing and annotation 
were performed by the DOE Joint Genome Insti- 
tute (JGI). A summary of the project information is 
shown in Table 2. 



MIGS ID 


Property 


Term 


MIGS-31 

MIGS-28 

MIGS-29 
MIGS-31. 2 
MIGS-30 
MIGS-32 


Finishing quality 

Libraries used 

Sequencing platforms 

Sequencing coverage 

Assemblers 

Gene calling method 


Non-contiguous finished 

Three genomic libraries: one 454 pyrosequence standard library, 

one 454 PE library (1 0 kb insert size), one lllumina library 

lllumina GAii, 454 GS FLX Titanium 

290.0 x lllumina; 48.0 x pyrosequence 

Newbler version 2.3, Velvet 0.7.63, phrap SPS 4.24 

Prodigal 1.4, GenePRIMP 




INSDC ID 


AFJE00000000 GL94501 5-GL94501 7 




Genbank Date of Release 


June 20, 2011 




GOLD ID 


Gi05358 


MIGS-13 


NCBI project ID 
Database: IMG-GEBA 
Source material identifier 


41513 

2503754046 
DSM 17128 




Project relevance 


Tree of Life, GEBA 



Growth conditions and DNA isolation 

P. multisaccharivorax PPPA20 T , DSM 17128, was 
grown anaerobically in DSMZ medium 104 (PYG- 
medium) [28] at 37°C. DNA was isolated from 0.5- 
1 g of cell paste using MasterPure Gram-positive 
DNA purification kit (Epicentre MGP04100) fol- 
lowing the standard protocol as recommended by 
the manufacturer with modification st/DL for cell 
lysis as described in Wu et al. 2009 [27]. DNA is 
available through the DNA Bank Network [29]. 

Genome sequencing and assembly 

The genome was sequenced using a combination 
of lllumina and 454 sequencing platforms. All 
general aspects of library construction and se- 
quencing can be found at the JGI website [30]. Py- 
rosequencing reads were assembled using the 
Newbler assembler (Roche). The initial Newbler 
assembly consisting of 154 contigs in five scaffolds 
was converted into a phrap [31] assembly by mak- 
ing fake reads from the consensus, to collect the 
read pairs in the 454 paired end library. lllumina 
GAii sequencing data (1,043.6 Mb) was assembled 
with Velvet [32] and the consensus sequences 
were shredded into 2.0 kb overlapped fake reads 
and assembled together with the 454 data. The 
454 draft assembly was based on 135.4 Mb 454 



standard data and all of the 454 paired end data. 
Newbler parameters are -consed -a 50 -1 350 -g -m 
-ml 20. The Phred/Phrap/Consed software pack- 
age [31] was used for sequence assembly and 
quality assessment in the subsequent finishing 
process. After the shotgun stage, reads were as- 
sembled with parallel phrap (High Performance 
Software, LLC). Possible mis-assemblies were cor- 
rected with gapResolution [30], Dupfinisher [33], 
or sequencing cloned bridging PCR fragments with 
subcloning or transposon bombing (Epicentre 
Biotechnologies, Madison, WI). Gaps between con- 
tigs were closed by editing in Consed, by PCR and 
by Bubble PCR primer walks (J.-F. Chang, unpub- 
lished). A total of 218 additional reactions were 
necessary to close gaps and to raise the quality of 
the finished sequence. lllumina reads were also 
used to correct potential base errors and increase 
consensus quality using a software Polisher de- 
veloped at JGI [34]. 

The error rate of the completed genome sequence 
is less than 1 in 100,000. Together, the combination 
of the lllumina and 454 sequencing platforms pro- 
vided 338 x coverage of the genome. The final as- 
sembly contained 325,939 pyrosequence and 
28,989,384 lllumina reads. 
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Genome annotation 

Genes were identified using Prodigal [35] as part 
of the Oak Ridge National Laboratory genome an- 
notation pipeline, followed by a round of manual 
curation using the JGI GenePRIMP pipeline [36]. 
The predicted CDSs were translated and used to 
search the National Center for Biotechnology In- 
formation (NCBI) non-redundant database, Uni- 
Prot, TIGR-Fam, Pfam, PRIAM, KEGG, COG, and In- 
terPro databases. Additional gene prediction anal- 
ysis and functional annotation was performed 
within the Integrated Microbial Genomes - Expert 
Review (IMG-ER) platform [37]. 



Genome properties 

The assembled genome sequence consists of three 
non-contiguous contigs with a length of 3,334,154 
bp, 47,474 bp and 7,016 bp with a G+C content of 
48.3% (Figure 3 and Table 3). Of the 2,951 genes 
predicted, 2,876 were protein-coding genes, and 
75 RNAs; 166 pseudogenes were also identified. 
The majority of the protein-coding genes (60.5%) 
were assigned with a putative function while the 
remaining ones were annotated as hypothetical 
proteins. The distribution of genes into COGs func- 
tional categories is presented in Table 4. 
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Figure 3. Graphical map of the largest scaffold. From outside to the center: Genes on forward strand 
(color by COG categories), Genes on reverse strand (color by COG categories), RNA genes (tRNAs 
green, rRNAs red, other RNAs black), GC content, GC skew. 
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Table 3. Genome Statistics 
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DNA G+C ronfpnt (hn) 
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T^U.J 1 /0 


Number of scaffolds 






i oidi genes 




1 nn nn°/ 

I UU.UU /o 


kina genes 


/ J 


Z.j4 /o 


rRNA operons 


A £. 




Protein-coding genes 


1 Q7£; 

2,0/ D 


Q7 /l AO/ 

y / .4d /o 


rseuuo genes 


I DO 


J.DJ /o 


vjenes in paraiog clusters 




1 A RAW 


Genes assigned to COGs 




ib.z//o 


Genes assigned Pfam domains 


1,864 


63.17% 


Genes with signal peptides 


782 


26.50% 


Genes with transmembrane helices 


588 


19.93% 


CRISPR repeats 


3 





Table 4. Number of genes associated with the general COG functional categories 



Code 


value 


%age 


1 lilt/ 1 ' 1 inn 

L/cscnpuuii 


i 


I 38 


/./ 


Translation, ribosomal structure and biogenesis 


a 

A 


(J 


0.0 


RNA processing and modification 


K 


I 02 


5.7 


Transcription 


L 


1 83 


10.1 


Replication, recombination and repair 


B 


0 


0.0 


Chromatin structure and dynamics 


D 


26 


1.4 


Cell cycle control, cell division, chromosome partitioning 


Y 


0 


0.0 


Nuclear structure 


V 


46 


2.6 


Defense mechanisms 


T 


63 


3.5 


Signal transduction mechanisms 


M 


155 


8.6 


Cell wall/membrane/envelope biogenesis 


N 


4 


0.2 


Cell motility 


Z 


0 


0.0 


Cytoskeleton 


w 


0 


0.0 


Extracellular structures 


u 


31 


1.7 


Intracellular trafficking, secretion, and vesicular transport 


o 


69 


3.8 


Posttranslational modification, protein turnover, chaperones 


c 


90 


5.0 


Energy production and conversion 


G 


145 


8.0 


Carbohydrate transport and metabolism 


E 


132 


7.3 


Amino acid transport and metabolism 


F 


59 


3.3 


Nucleotide transport and metabolism 


H 


74 


4.1 


Coenzyme transport and metabolism 


1 


56 


3.1 


Lipid transport and metabolism 


P 


120 


6.7 


Inorganic ion transport and metabolism 


Q 


27 


1.5 


Secondary metabolites biosynthesis, transport and catabolism 


R 


202 


11.2 


General function prediction only 


S 


82 


4.6 


Function unknown 




1,292 


43.8 


Not in COGs 
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